6 research outputs found

    Hardware-conscious Query Processing in GPU-accelerated Analytical Engines

    Get PDF
    In order to improve their power efficiency and computational capacity, modern servers are adopting hardware accelerators, especially GPUs. Modern analytical DMBS engines have been highly optimized for multi-core multi-CPU query execution, but lack the necessary abstractions to support concurrent hardware-conscious query execution over multiple heterogeneous devices and, thus, are unable to take full advantage of the available accelerators. In this work, we present a Heterogeneity-conscious Analytical query Processing Engine (HAPE), a hardware-conscious analytical engines that targets efficient concurrent multi-CPU multi-GPU query execution. HAPE decomposes heterogeneous query execution into i) efficient single-device and ii) concurrent multi-device query execution. It uses hardware-conscious algorithms designed for single-device execution and combines them into efficient intra-device hardware-conscious execution modules, via code generation. HAPE combines these modules to achieve concurrent multi-device execution by handling data and control transfers. We validate our design by building a prototype and evaluate its performance on a co-processing radix-join and TPC-H queries. We show that it achieves up to 10x and 3.5x speed-up on the join against CPU and GPU alternatives and 1.6x-8x against state-of-the-art CPU- and GPU-based commercial DBMS on the queries

    HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines

    Get PDF
    Modern server hardware is increasingly heterogeneous as hardware accelerators, such as GPUs, are used together with multicore CPUs to meet the computational demands of modern data analytics workloads. Unfortunately, query parallelization techniques used by analytical database engines are designed for homogeneous multicore servers, where query plans are parallelized across CPUs to process data stored in cache coherent shared memory. Thus, these techniques are unable to fully exploit available heterogeneous hardware, where one needs to exploit task-parallelism of CPUs and data-parallelism of GPUs for processing data stored in a deep, non-cache-coherent memory hierarchy with widely varying access latencies and bandwidth. In this paper, we introduce HetExchange–a parallel query execution framework that encapsulates the heterogeneous parallelism of modern multi-CPU–multi-GPU servers and enables the parallelization of (pre-)existing sequential relational operators. In contrast to the interpreted nature of traditional Exchange, HetExchange is designed to be used in conjunction with JIT compiled engines in order to allow a tight integration with the proposed operators and generation of efficient code for heterogeneous hardware. We validate the applicability and efficiency of our design by building a prototype that can operate over both CPUs and GPUs, and enables its operators to be parallelism- and data-location-agnostic. In doing so, we show that efficiently exploiting CPU–GPU parallelism can provide 2.8x and 6.4x improvement in performance compared to state-of-the-art CPU-based and GPU-based DBMS

    Hardware-conscious Hash-Joins on GPUs

    Get PDF
    Traditionally, analytical database engines have used task parallelism provided by modern multisocket multicore CPUs for scaling query execution. Over the past few years, GPUs have started gaining traction as accelerators for processing analytical queries due to their massively data-parallel nature and high memory bandwidth. Recent work on designing join algorithms for CPUs has shown that carefully tuned join implementations that exploit underlying hardware can outperform naive, hardware-oblivious counterparts and provide excellent performance on modern multicore servers. However, there has been no such systematic analysis of hardware-conscious join algorithms for GPUs that systematically explores the dimensions of partitioning (partitioned versus non-partitioned joins), data location (data fitting and not fitting in GPU device memory), and access pattern (skewed versus uniform). In this paper, we present the design and implementation of a family of novel, partitioning-based GPU-join algorithms that are tuned to exploit various GPU hardware characteristics for working around the two main limitations of GPUs–limited memory capacity and slow PCIe interface. Using a thorough evaluation, we show that: i) hardware-consciousness plays a key role in GPU joins similar to CPU joins and our join algorithms can process 1 Billion tuples/second even if no data is GPU resident, ii) radix partitioning-based GPU joins that are tuned to exploit GPU hardware can substantially outperform non-partitioned hash joins, iii) hardware-conscious GPU joins can effectively overcome GPU limitations and match, or even outperform, state-of-the-art CPU joins

    Υψηλής απόδοσης μέθοδοι διανυσμάτων υποστήριξης σε ροές δεδομένων

    No full text
    Summarization: We are in an era where data are constantly being generated and machine learning can benefit from this to produce better models. Support vector machines are a popular machine learning model, which can be adapted and used for various tasks, such as classification, regression and clustering. We study the problem of continuously updating L2 Support Vector Machines in a distributed environment where new data constantly arrive in remote sites. We approach this as the problem of tracking a convex function’s minimum over the convex hull of the union of fully dynamic sets, each one located in one of the sites. We give communication efficient solutions for both the exact and approximate variants of the problem and show that they are applicable in the case of a kernelized SVM trained in an explicit feature space. In our proposed methods, the sites communicate only when it is necessary, that is, every time the model has been truly outdated. Also, in case the sites are forced to communicate, we propose two algorithms, one iterative and one with a single stage of communication.Περίληψη: Ζούμε σε μια εποχή όπου δεδομένα δημιουργούνται συνεχώς και η μηχανική μάθηση μπορεί να τα χρησιμοποιήσει για να παράγει καλύτερα μοντέλα. Οι μηχανές διανυσμάτων υποστήριξης (SVMs) είναι ένα δημοφιλές μοντέλο το οποίο μπορεί να προσαρμοστεί και να χρησιμοποιηθεί για διάφορες χρήσεις, όπως για ταξινόμηση, παρεµβολή και ομαδοποίηση. Μελετάμε το πρόβλημα ταυ να ενημερώνουμε συνεχόμενα L2 SVMS σε ένα κατανεμημένο περιβάλλον όπου νέα δεδομένα φτάνουν συνεχώς σε απομακρυσμένους κόμβους. Προσεγγίζουμε το πρόβλημα ως παρακολούθηση του ελαχίστου μίας κυρτής συνάρτησης πάνω στην κυρτή θήκη της ένωσης πλήρως δυναμικών συνόλων, που κάθε ένα από τα οποία βρίσκεται σε έναν από τους κόμβους. Δίνουμε επικοινωνιακά αποτελεσματικές λύσεις για την ακριβή και την προσεγγιστική λύση του προβλήματος και δείχνουμε ότι είναι εφαρμόσημες στην περίπτωση που οι SVM με πυρήνα και εκπαιδεύονται σε έναν υπονοούμενο χώρο. Οι μέθοδοι μας, επικοινωνούν μόνο όταν είναι απαραίτητο, δηλαδή κάθε φορά που το μοντέλο είναι πραγματικά μη ενημερωμένο. Επίσης, στην περίπτωση που χρειάζεται να επικοινωνήσουν, προτείνουμε δυο αλγόριθμους, ένα επαναληπτικό και με ένα στάδιο επικοινωνίας

    GPU-accelerated data management under the test of time

    No full text
    GPUs are becoming increasingly popular in large scale data center installations due to their strong, embarrassingly parallel, processing capabilities. Data management systems are riding the wave by using GPUs to accelerate query execution, mainly for analytical workloads. However, this acceleration comes at the price of a slow interconnect which imposes strong restrictions in bandwidth and latency when bringing data from the main memory to the GPU for processing. The related research in data management systems mostly relies on late materialization and data sharing to mitigate the overheads introduced by slow interconnects even in the standard CPU processing case. Finally, workload trends move beyond analytical to fresh data processing, typically referred to as Hybrid Transactional and Analytical Processing (HTAP). Therefore, we experience an evolution in three different axes: interconnect technology, GPU architecture, and workload characteristics. In this paper, we break the evolution of the technological landscape into steps and we study the applicability and performance of late materialization and data sharing in each one of them. We demonstrate that the standard PCIe interconnect substantially limits the performance of state-of-the-art GPUs and we propose a hybrid materialization approach which combines eager with lazy data transfers. Further, we show that the wide gap between GPU and PCIe throughput can be bridged through efficient data sharing techniques. Finally, we provide an H2TAP system design which removes software-level interference and we show that the interference in the memory bus is minimal, allowing data transfer optimizations as in OLAP workloads
    corecore